Below you will find several empty R code scripts and answer prompts. Your task is to fill in the required code snippets and answer the corresponding questions.
Today, we start by looking at a collection of breakfast cereals:
With variables:
Produce a histogram of the sugar variable.
Now, compute the standard deviation of the variable sugar:
## [1] -1.02631579 0.97368421 -2.02631579 -7.02631579 0.97368421
## [6] 2.97368421 6.97368421 0.97368421 -1.02631579 -2.02631579
## [11] 4.97368421 -6.02631579 1.97368421 -0.02631579 5.97368421
## [16] -4.02631579 -5.02631579 4.97368421 5.97368421 -0.02631579
## [21] -7.02631579 -4.02631579 2.97368421 -2.02631579 5.97368421
## [26] 3.97368421 -0.02631579 2.97368421 4.97368421 4.97368421
## [31] 7.97368421 1.97368421 -2.02631579 -4.02631579 -3.02631579
## [36] 3.97368421 2.97368421 3.97368421 -1.02631579 1.97368421
## [41] -4.02631579 -1.02631579 4.97368421 -4.02631579 3.97368421
## [46] 3.97368421 5.97368421 -1.02631579 1.97368421 -0.02631579
## [51] -5.02631579 2.97368421 6.97368421 -4.02631579 -7.02631579
## [56] -7.02631579 -1.02631579 4.97368421 0.97368421 -1.02631579
## [61] -5.02631579 -4.02631579 -7.02631579 -7.02631579 -7.02631579
## [66] 7.97368421 -4.02631579 -2.02631579 -4.02631579 6.97368421
## [71] -4.02631579 -4.02631579 4.97368421 -4.02631579 -4.02631579
## [76] 0.97368421
What are the units of this measurement?
Answer:grams
Now, compute the deciles of the variable score:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 18.0 28.0 31.0 34.5 37.0 40.0 42.0 48.0 53.0 58.0 84.0
What is the value of the 30th percentile. Describe what this means in words:
Answer: 34.5 It means the cereal is equal to or greater than 30% of other cereals.
Produce a boxplot of score and brand.
Which brand seems to have the healthiest cereals?
Answer: Kelloggs
Produce a boxplot of score and shelf.
Produce a boxplot of sugar and shelf.
If I want a healthy but reasonably sweet cereal which shelf would be the best to look on?
Answer: Top Shelf
Next, we will take another look at a dataset of tea reviews that I used in a previous lecture:
With variables: - name: the full name of the tea - type: the type of tea. One of: - black - chai - decaf - flavors - green - herbal - masters - matcha - oolong - pu_erh - rooibos - white - score: user rated score; from 0 to 100 - price: estimated price of one cup of tea - num_reviews: total number of online reviews
Draw a scatterplot with num_reviews (x-axis) against score (y-axis) and add a regression line (recall: geom_smooth(method="lm")).
Does the score tend to increase, decrease, or remain the same as the number of reviews increases?
Answer:Increase
Calculate the ventiles of the variable price.
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
## 8.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
## 10% 11% 12% 13% 14% 15% 16% 17% 18% 19%
## 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
## 20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
## 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 12.00 12.00
## 30% 31% 32% 33% 34% 35% 36% 37% 38% 39%
## 12.00 12.00 12.00 12.00 12.00 12.00 12.00 12.00 12.00 12.00
## 40% 41% 42% 43% 44% 45% 46% 47% 48% 49%
## 12.00 12.00 12.00 12.00 12.00 12.00 12.00 12.00 12.00 12.00
## 50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
## 13.00 14.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00
## 60% 61% 62% 63% 64% 65% 66% 67% 68% 69%
## 15.00 15.00 15.00 15.00 15.00 17.00 17.00 17.00 17.32 19.00
## 70% 71% 72% 73% 74% 75% 76% 77% 78% 79%
## 19.00 19.00 19.00 19.00 19.38 20.00 22.00 24.49 25.00 27.46
## 80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
## 30.00 32.00 32.00 32.00 34.00 35.35 38.64 40.00 44.56 46.93
## 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
## 49.30 60.00 60.00 60.82 71.02 86.75 95.04 137.38 157.00 157.00
## 100%
## 196.00
What is the 80th percentile? Describe it in words, include the units of the problem in your answer.
Answer: The 80th percentile is 17 dollars.
Plot the number of reviews (x-axis) against the score variable. Color the points according to price binned into 5 buckets.
What tends to be true about the number of reviews for the most expensive 20% of teas?
Answer:There are less reviews than the others have.
Create a dataset named white that consists of only white teas.
## # A tibble: 17 x 5
## name type score price num_reviews
## <chr> <chr> <int> <int> <int>
## 1 silver_needle white 95 64 963
## 2 jasmine_silver_needle white 96 49 678
## 3 white_symphony white 94 32 577
## 4 white_peach white 95 19 1669
## 5 white_strawberry white 94 19 488
## 6 white_peony white 93 25 1113
## 7 white_blueberry white 94 19 1353
## 8 white_eternal_spring white 92 19 340
## 9 white_pear white 91 19 814
## 10 white_darjeeling white 94 46 107
## 11 white_fuzzy_navel white 93 19 17
## 12 white_grapefruit white 92 19 379
## 13 white_pearls white 92 20 49
## 14 white_tropics white 90 19 620
## 15 white_tangerine white 90 19 495
## 16 snowbud white 93 32 575
## 17 white_cucumber white 88 19 521
Calculate the standard deviation of the price for white teas and the standard deviation of the price for all of the teas.
## [1] 13.59444
Is the variation of the white tea prices smaller, larger, or about the same as the entire dataset?
Answer:The variation of the white tea prices is smaller than the entire dataset.
Summarize the dataset by the type of tea and save the results as a variable named tea_type.
## # A tibble: 12 x 14
## type score_mean price_mean num_reviews_mean score_median price_median
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 black 93.7 23.4 995 94.0 17.0
## 2 chai 93.3 12.0 1069 93.0 12.0
## 3 decaf 93.2 15.0 303 94.0 15.0
## 4 flavo… 92.0 10.0 891 92.0 10.0
## 5 green 93.0 17.9 668 93.0 12.0
## 6 herbal 93.2 11.6 916 93.0 12.0
## 7 maste… 94.6 124 115 95.0 142
## 8 matcha 91.0 60.0 108 92.0 60.0
## 9 oolong 93.5 28.9 636 94.0 30.5
## 10 pu_erh 91.6 20.6 473 92.0 15.0
## 11 rooib… 92.3 11.7 509 92.5 10.0
## 12 white 92.7 26.9 633 93.0 19.0
## # ... with 8 more variables: num_reviews_median <dbl>, score_sd <dbl>,
## # price_sd <dbl>, num_reviews_sd <dbl>, score_sum <int>,
## # price_sum <int>, num_reviews_sum <int>, n <int>
Plot the average price (x-axis) against the average score (y-axis) of each type of tea. Make the size of the points proportional to the number of teas in each category and label the points with geom_text_repel and the tea type.
Describe an interesting pattern or set of outliers that you found in the previous plot. This does not need to take more than 1-2 sentences.
Answer: The set of outliers that I found were matcha and masters types of tea. They both have an increased average price, but are the opposite extremes in average score.